The use of error tags in ARTFL's Encyclopédie: Does good error identification lead to good error correction?
نویسنده
چکیده
Many corpora which are prime candidates for automatic error correction, such as the output of OCR software, and electronic texts incorporating markup tags, include information on which portions of the text are most likely to contain errors. This paper describes how the error markup tag is being incorporated in the spell-checking of an electronic version of Diderot's Encyclopddie, and evaluates whether the presence of this tag has significantly aided in correcting the errors which it marks. Although the usefulness of error tagging may vary from project to project, even as the precise way in which the tagging is done varies, error tagging does not necessarily confer any benefit in attempting to correct a given word. It may, of course, nevertheless be useful in marking errors to be fixed manually at a later stage of processing the text. 1 The Encyclopddie 1.1 Project Overview The goal of this project is ultimately to detect and correct all errors in the electronic version of the 18th century French encyclopedia of Diderot and d'Alembert, a corpus of ca. 18 million words. This text is currently under development by the Project for American and French Research on the Treasury of the French Language (ARTFL); a project overview and limited sample of searchable text from the Encyclopddie are available at: Andreev et al. (1999) also provides a thorough summary of the goals and status of the project. The electronic text was largely transcribed from the original, although parts of it were produced by optical character recognition on scanned images. Unfortunately, whether a section of text was transcribed or produced by OCR was not recorded at the time of data capture, so that the error correction strategy cannot be made sensitive to this parameter. Judging from a small hand-checked section of the text, the error rate is fairly low; about one word in 40 contains an error. It should also be added that the version of the text with which I am working has already been subjected to some corrective measures after the initial data capture stage. For example , common and easily identifiable mistakes such as the word enfant showing up as en-sant were simply globally repaired throughout the text. (The original edition of the En-cyclop~die made use of the sharp 's', which was often confused with an 'f' during data entry-cf. Figure 1.) At present, my focus is on non-word error detection and correction, …
منابع مشابه
Study of the Long Run Relationship Between Good Market Efficiency and Labor Market Efficiency in the Global Competitiveness Index and the Variables of Economic Success (Economic Growth and Unemployment) in Selected Countries of Asia
This study examines the long run relationship between the efficiency component (good market efficiency and labor market efficiency) in the global competitiveness index and the variables of economic success (economic growth and unemployment) by using new econometric methods in selected countries of Asia with the average upward Global Competitiveness Index. This study, in the framework of the Pan...
متن کاملGrammatical Error Correction of English as Foreign Language Learners
This study aimed to discover the insight of error correction by implementing two correction systems on three Iranian university students. The three students were invited to write four in-class essays throughout the semester, in which their verb errors and individual-selected errors were corrected using the Code Correction System and the Individual Correction System. At the end of the study, the...
متن کاملOptimal fast digital error correction method of pipelined analog to digital converter with DLMS algorithm
In this paper, convergence rate of digital error correction algorithm in correction of capacitor mismatch error and finite and nonlinear gain of Op-Amp has increased significantly by the use of DLMS, an evolutionary search algorithm. To this end, a 16-bit pipelined analog to digital converter was modeled. The obtained digital model is a FIR filter with 16 adjustable weights. To adjust weights o...
متن کاملIdentification and evaluation of human errors of Epoxy control room operators of a pipe Mill company using HEC technique
Introduction: In many workplaces today, the incidence of human error can lead to catastrophic accidents in which human error is the main cause of accidents. Due to the vital role of the control room in guiding and controlling various sites of the pipe industry, especially the outer coating sector, the incidence of any error can lead to human accidents, damage to machinery, and interruption in p...
متن کاملEFL Learners’ Preferences for Error Correction and Its Relationship with Demotivation and Language Proficiency in the Iranian Context
The present study is an attempt to explore any significant relationships between learners’ preferences for error correction, demotivation, and language proficiency (LP). One hundred Iranian EFL students, including both males and females, studying at the departments of foreign languages of Shahid Bahonar University of Kerman and Tehran University took part in this study. In order to obtain the r...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000